client: Change connectivity state to CONNECTING when creating the name resolver #8710

easwars · 2025-11-14T21:48:06Z

Fixes #7686

Current Behavior

When client exits IDLE and creates the name resolver, it stays in IDLE until the connectivity state is set by the LB policy.
When exiting IDLE mode (because of Connect being called or because of an RPC), if name resolver creation fails, we stay in IDLE.

New Behavior

When the client exits IDLE and creates the name resolver, it moves to CONNECTING. Moving forward, the connectivity state will be set by the LB policy.
When exiting IDLE mode (because of Connect being called or because of an RPC), we have already moved to CONNECTING (because of the previous bullet point). If name resolver creation fails, we will move to TRANSIENT_FAILURE and start the idle timer and move back to IDLE when the timer fires

RELEASE NOTES:

client: Change connectivity state to CONNECTING when creating the name resolver (as part of exiting IDLE).
client: Change connectivity state to TRANSIENT_FAILURE if name resolver creation fails.
client: Change connectivity state to IDLE after idle timeout expires (also when current state is TRANSIENT_FAILURE).

…resolver

codecov · 2025-11-14T21:51:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.25%. Comparing base (112ec12) to head (5273e8a).
⚠️ Report is 7 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8710      +/-   ##
==========================================
- Coverage   83.28%   83.25%   -0.04%     
==========================================
  Files         416      419       +3     
  Lines       32267    32433     +166     
==========================================
+ Hits        26874    27002     +128     
- Misses       4019     4047      +28     
- Partials     1374     1384      +10

Files with missing lines	Coverage Δ
clientconn.go	`90.90% <100.00%> (+0.77%)`	⬆️
internal/idle/idle.go	`89.28% <100.00%> (+0.12%)`	⬆️
resolver_wrapper.go	`92.45% <ø> (ø)`
stream.go	`81.83% <100.00%> (ø)`

... and 20 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dfawley · 2025-11-20T22:11:20Z

dial_test.go


 func (s stringerVal) String() string { return s.s }
+
+const errResolverBuildercheme = "test-resolver-build-failure"


dfawley · 2025-11-20T22:27:38Z

resolver_wrapper.go

+		// https://github.com/grpc/grpc/blob/master/doc/connectivity-semantics-and-api.md
+		// defines CONNECTING as follows:
+		// - The channel is trying to establish a connection and is waiting to
+		//   make progress on one of the steps involved in name resolution, TCP
+		//   connection establishment or TLS handshake. This may be used as the
+		//   initial state for channels upon creation.
+		//
+		// We are starting the name resolver here as part of exiting IDLE, so
+		// transitioning to CONNECTING is the right thing to do.


IMO comments should be short and to the point.

Short comments make the code take up less space, which makes it easier to read and understand. Long comments make long functions extremely long and not fit on the page.

Honestly, I think a comment for this action isn't even necessary. But if you think we need one, this could be:

// Set state to CONNECTING before building the name resolver // so the channel does not remain in IDLE.

dfawley · 2025-11-20T22:33:09Z

test/clientconn_state_transition_test.go

+			if state := cc.GetState(); state != connectivity.Idle {
+				t.Fatalf("Expected initial state to be IDLE, got %v", state)
+			}


The AwaitState above already tested this IIUC

dfawley · 2025-11-20T22:33:59Z

test/clientconn_state_transition_test.go

+			// Ensure that the client is in IDLE before connecting.
+			ctx, cancel := context.WithTimeout(context.Background(), defaultTestTimeout)
+			defer cancel()
+			testutils.AwaitState(ctx, t, cc, connectivity.Idle)


This doesn't need an Await right? It should just check the current state, and never wait for changes, as we know it starts idle.

That's true. Moved the check for the current state to here, and got rid of the Await.

dfawley · 2025-11-20T22:35:18Z

test/clientconn_state_transition_test.go

+			for _, wantState := range wantStates {
+				waitForState(ctx, t, stateCh, wantState)
+				if wantState == connectivity.Idle {
+					tt.exitIdleFunc(ctx, cc)


Can we make this test the actual RPC error when we use an RPC to exit idle?

Done. I changed the test a little to make that happen.

But as part of doing that, I realized a thing or two:

The RPC is not failing with a status error, because the idle.Manager.ExitIdleMode which is called when an RPC has to kick the channel out of IDLE, does not embed the error returned from ClientConn.ExitIdleMode. But if we make it embed the error, we will have to return a status error from the latter, which is doable. But that brings the following questions:

What code do we return? I'm torn between Unavailable and Internal, and leaning towards the latter

This would also make Dial fail with a status error which I find a little odd.

Thoughts?

Maybe Unavailable and then it's treated the same as if the resolver did a ReportError immediately instead of failing to build? But probably we should see what the C++/Java lame/failing channels do before deciding, since that seems like the most equivalent scenario in languages where the resolver can't fail to build -- it just doesn't exist.

Switched to Unavailable. Also changed the idleness manager code to embed the error, so that the RPC can see the status code.

This does mean that the error returned from Dial in this scenario would be a status error. But I guess that's an OK thing to do, unless you feel otherwise. In which case, we can explicitly return a new error here and thereby remove the embedding:

grpc-go/clientconn.go

Line 266 in 76c67d1

return nil, err

I think I would prefer to have the idlenessmgr continue to return the resolver error directly, and have the RPC path return an Unavailable status with that error as the message.

Now that I look more closely, what happens to the second RPC after this happens, with the current state of your code? It seems it will only do anything with the idlenessMgr on the initial exit from idle (here, which short-circuits if not idle here), and nothing ever produces a picker when the initial exit from idle mode fails, so I think the channel will just go to the picker wrapper and wait for a picker until the RPC times out.

Switched to not embedding the error from the idleness manager and handling the status code in the rpc path.

Also, the second RPC works just fine (i.e., that also returns Unavailable just like the first one), because when exiting IDLE fails, we undo the idle entry process. So, the next RPC behaves just like the first RPC.

What does that mean? I thought the goal was to stay in TF until the idle timer expires, not go immediately back to IDLE and rebuild the resolver on the next call?

What does that mean?

By "we undo the idle entry process" I meant the internals of the idleness manager in terms of the state being tracked etc. We will move to TF at the channel level.

OK I'm confused.

How does the error get from the name resolver to the RPC after we've left idle. I don't see where we're saving it in an RPC picker, e.g., or where the idleness manager is storing the "last error encountered while trying to leave idle" so that it can return it on subsequent RPCs.

Something about this approach seems wrong to me.

I think ExitIdleMode should be infallible. Instead, if the resolver builder fails, the channel should still leave idle mode, and it should make sure its state is set such that future RPCs will fail.

resolver_wrapper.go

dfawley · 2025-11-25T21:21:27Z

resolver_balancer_ext_test.go

 // Tests the case where the resolver reports an error to the channel before
 // reporting an update. Verifies that the channel eventually moves to
-// TransientFailure and a subsequent RPC returns the error reported by the
+// TransientFailure and a subsequent RPCs returns the error reported by the


dfawley · 2025-11-25T22:46:15Z

test/clientconn_state_transition_test.go

+			for _, wantState := range wantStates {
+				waitForState(ctx, t, stateCh, wantState)
+				if wantState == connectivity.Idle {
+					tt.exitIdleFunc(ctx, cc)


OK I'm confused.

How does the error get from the name resolver to the RPC after we've left idle. I don't see where we're saving it in an RPC picker, e.g., or where the idleness manager is storing the "last error encountered while trying to leave idle" so that it can return it on subsequent RPCs.

dfawley · 2025-11-25T22:49:58Z

resolver_wrapper.go

 			Authority:            ccr.cc.authority,
 			MetricsRecorder:      ccr.cc.metricsRecorderList,
 		}
+


Please revert this diff & file.

client: move connectivity state to CONNECTING when creating the name …

1cb398d

…resolver

easwars requested a review from dfawley November 14, 2025 21:48

easwars assigned dfawley Nov 14, 2025

easwars added Type: Bug Area: Client Includes Channel/Subchannel/Streams, Connectivity States, RPC Retries, Dial/Call Options and more. labels Nov 14, 2025

easwars added this to the 1.78 Release milestone Nov 14, 2025

easwars requested a review from arjan-bal November 14, 2025 21:48

easwars assigned arjan-bal Nov 14, 2025

make vet happy

86503d2

dfawley reviewed Nov 20, 2025

View reviewed changes

easwars added 8 commits November 21, 2025 07:10

fix typo

f496de6

shorten comment, move error handling and setting state to the channel

36f14e6

check for IDLE after channel creation without awaiting

0131893

ensure resolver build error is returned to RPC

bd0ee2b

add a test to verify the case where resolver reports error

a88158b

return Unavailable when resolver build fails

a7085d5

return Unavailable only in the RPC path

7c12dde

make more than one RPC

5273e8a

dfawley reviewed Nov 25, 2025

View reviewed changes


		func (s stringerVal) String() string { return s.s }

		const errResolverBuildercheme = "test-resolver-build-failure"

client: Change connectivity state to CONNECTING when creating the name resolver #8710

Are you sure you want to change the base?

client: Change connectivity state to CONNECTING when creating the name resolver #8710

Uh oh!

Conversation

easwars commented Nov 14, 2025

Current Behavior

New Behavior

Uh oh!

codecov bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Nov 14, 2025 •

edited

Loading